Objectives

In this section we will learn how to generate graphics in R.

The material presented is here largely based on the material presented in:

Install and load packages

# Clear the workspace
rm(list = ls())
# List of package needed for this workshop
reqpkg <- c("ggplot2", "ggrepel", "ggthemes", "grid", "gridExtra", 
            "RColorBrewer")
# Check if the packages are installed:
inpkg = installed.packages()[, "Package"] #installed packages
neededpkg = reqpkg[!reqpkg %in% inpkg]
if(length(neededpkg) > 0){
  stop(paste("\n Need to install the following package:", neededpkg))
}
# Load all required packages and show version
for(i in reqpkg){
    print(paste(i, "version:", packageVersion(i)))
    library(i, quietly=TRUE, verbose=FALSE, warn.conflicts=FALSE, character.only=TRUE)
}
[1] "ggplot2 version: 2.1.0"
[1] "ggrepel version: 0.5.1"
[1] "ggthemes version: 3.2.0"
[1] "grid version: 3.3.0"
[1] "gridExtra version: 2.2.1"
[1] "RColorBrewer version: 1.1.2"

If you haven’t done that already, download the workshop materials:

setwd("/path/to/dir/")
download.file("/url/to/workshop/materials", "R_workshop.zip")
unzip("R_workshop.zip")

Introduction

Instructions:

This course assumes that you:

What is ggplot2?

ggplot2 is a plotting system for R, based on the grammar of graphics. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. 1

Advantages of ggplot2:

Weaknesses of ggplot2 (what the package should not be used for):

What is the grammar of graphics?

It is a concept coined by Leland Wilkinson in 2005.

The basic idea: a plot is defined by independent building blocks, which combined create just about any kind of visualization you want.

The building blocks of a graph include:

The structure of ggplot object

The ggplot() function is used to initialize the basic graph structure. It cannot produce the plot we want by itself. Instead, we need to add extra building blocks it. The structure of a ggplot looks like this:

ggplot(data = <default data set>, 
       aes(x = <default x axis variable>,
           y = <default y axis variable>,
           ... <other default aesthetic mappings>),
       ... <other plot defaults>) +
  
  geom_<geom type>(aes(size = <size variable for this geom>, 
                       ... <other aesthetic mappings>),
                   data = <data for this point geom>,
                   stat = <statistic string or function>,
                   position = <position string or function>,
                   color = <"fixed color specification">,
                  ... <other arguments, possibly passed to the _stat_ function) +

  scale_<aesthetic>_<type>(name = <"scale label">,
                           breaks = <where to put tick marks>,
                           labels = <labels for tick marks>,
                           ... <other options for the scale>) +

  theme(plot.background = element_rect(fill = "gray"),
        ... <other theme elements>)

This chunk of code might seem confusing, but by the end of this workshop you should be able to understand each of the components.

The basic idea is that you specify different parts of the plot, and add them together using the + operator.

ggplot2 vs base graphics:

ggplot2 compared to base graphics is:

Example 1: History of unemployemnt

data("economics")
str(economics)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   574 obs. of  6 variables:
 $ date    : Date, format: "1967-07-01" "1967-08-01" "1967-09-01" "1967-10-01" ...
 $ pce     : num  507 510 516 513 518 ...
 $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
 $ psavert : num  12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
 $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
 $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
head(economics)
Source: local data frame [6 x 6]

        date   pce    pop psavert uempmed unemploy
      <date> <dbl>  <int>   <dbl>   <dbl>    <int>
1 1967-07-01 507.4 198712    12.5     4.5     2944
2 1967-08-01 510.5 198911    12.5     4.7     2945
3 1967-09-01 516.3 199113    11.7     4.6     2958
4 1967-10-01 512.9 199311    12.5     4.9     3143
5 1967-11-01 518.1 199498    12.5     4.7     3066
6 1967-12-01 525.8 199657    12.1     4.8     3018

Base graphics with a simple plot() function:

plot(unemploy/pop ~ date, data = economics,  type = "l")

A similar plot using ggplot2

ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line()

Note that, ggplot() by itself does not plot the data.

ggplot(data = economics, aes(x = date, y = unemploy/pop))

You need to add the lines object.

ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line()

… and possibly change the background color from default gray to customized white.

ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line() +
  theme_bw()

What if we want to compare the trend from year 2009 to 2014?

Add two variables one for year and one for day of the year:

economics$dayOftheYear <- format(economics$date, format="%m-%d")
economics$dayOftheYear <- as.Date(economics$dayOftheYear, format="%m-%d")
economics$year <- format(economics$date, format="%Y")
head(economics)
Source: local data frame [6 x 8]

        date   pce    pop psavert uempmed unemploy dayOftheYear  year
      <date> <dbl>  <int>   <dbl>   <dbl>    <int>       <date> <chr>
1 1967-07-01 507.4 198712    12.5     4.5     2944   2016-07-01  1967
2 1967-08-01 510.5 198911    12.5     4.7     2945   2016-08-01  1967
3 1967-09-01 516.3 199113    11.7     4.6     2958   2016-09-01  1967
4 1967-10-01 512.9 199311    12.5     4.9     3143   2016-10-01  1967
5 1967-11-01 518.1 199498    12.5     4.7     3066   2016-11-01  1967
6 1967-12-01 525.8 199657    12.1     4.8     3018   2016-12-01  1967

Using base graphics:

plot(unemploy/pop ~ dayOftheYear, data = subset(economics, year == 2009), 
     ylim = c(0.025, 0.05), type = "l")
lines(unemploy/pop ~ dayOftheYear, col = "red", data = subset(economics, year == 2014))
legend("topleft",
       c("2008", "2014"), title="Year",
       col=c("black", "red"),
       pch=c(1, 1))

Using ggplot2:

ggplot(data = subset(economics, year %in% c(2014, 2009)), 
       aes(x = dayOftheYear, y = unemploy/pop)) + 
  geom_line(aes(color = year)) 

Note that there is no need to specify the legend! It is produced automatically in ggplot2.

It is easy to even plot all the years together:

ggplot(data = economics, aes(x = dayOftheYear, y = unemploy/pop)) + 
  geom_line(aes(color = year))

Example 2: diamonds dataset

We will now load a diamonds data set that is included in with the ggplot2 package.

The data set contains the prices and other attributes of almost 54,000 diamonds. You can call ?diamonds to learn more about the available attributes.

data("diamonds")
str(diamonds)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
Source: local data frame [6 x 10]

  carat       cut  color clarity depth table price     x     y     z
  <dbl>    <fctr> <fctr>  <fctr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
2  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
3  0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
4  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
5  0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48

It is easy to plot the distribution of the diamonds prices with base graphics

hist(diamonds$price)

as well as with ggplot2

ggplot(diamonds, aes(x = price)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now we subset the data to show the relationship between the diamonds weights (carat = 200 mg) and their prices ($):

set.seed(12345) # Make the sample reproducible
dsmall <- diamonds[sample(nrow(diamonds), 200), ]

Base graphics:

colorMap <- data.frame(color = rainbow(length(unique(dsmall$color))))
rownames(colorMap) <- unique(dsmall$color)
plot(price ~ carat, data = dsmall, col = colorMap[dsmall$color, "color"])
legend(x = 'bottomright', 
       legend = rownames(colorMap),
       col = colorMap$color, pch = par("pch"), bty = 'n', xjust = 1)

ggplot2:

ggplot(data = dsmall, aes(x = carat, y = price, color = color)) + geom_point()

Geometric objects and Aesthetics

Geometic Objects (geom):

Geometric objects are the actual items we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator.

You can get a list of available geometric objects using the code below:

help.search("geom_", package = "ggplot2")

or simply type geom_<tab> in any good R IDE (such as Rstudio or ESS) to see a list of functions starting with geom_.

Aesthetic Mapping

In ggplot an aesthetic mapping, defined with aes(), describes how variables are mapped to visual properties or aesthetics.

Examples of aesthetics are:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • linetype
  • size

Each type of geom accepts only a subset of all aesthetics. Refer to the geom help pages to see what mappings each geom accepts.

Scatter plots (geom_points)

p1 <- ggplot(dsmall, aes(x = carat, y = price))
p1 + geom_point()

p1 + geom_point(aes(color = color))

p1 + geom_point(aes(shape = cut))

p1 + geom_point(aes(shape = cut, color = color))

Aesthetic Mapping vs Assignment

Note that variables are mapped to aesthetics with the aes() function, while fixed aesthetics are set outside the aes() call.

This sometimes leads to confusion, as in this example:

ggplot(data = dsmall, aes(x = carat, y = price)) + 
  geom_point(aes(size = 2), # this is conceptually wrong since 2 is not a variable
             color = "darkgreen") # this is ok since you might want all points to be green.

ggplot(data = dsmall, aes(x = carat, y = price)) + 
  geom_point(aes(fill = cut), size = 2, color = "black", shape = 25)

Available shape configurations

## A look at all 25 symbols
df2 <- data.frame(x = 1:5 , y = 1:25, z = 1:25)
s <- ggplot(df2, aes(x = x, y = y))
s + geom_point(aes(shape = z), size = 3, colour = "blue") +
  scale_shape_identity()

## While all symbols have a foreground colour, symbols 19-25 also take a
## background colour (fill)
s + geom_point(aes(shape = z), size = 3, colour = "darkgreen", fill = "orange") +
  scale_shape_identity()

Data transformations

We can plot the transformed data by calling a function on the variable. For example here we show the log transformed data.

ggplot(dsmall, aes(x = log(carat), y = log(price))) + geom_point()

Text labels

Use even a smaller subset to avoid cluttering:

set.seed(12345) # Make the sample reproducible
dsmall2 <- diamonds[sample(nrow(diamonds), 100), ]
p2 <- ggplot(dsmall2, aes(x = log(carat), y = log(price)))
p2 + geom_text(aes(label = color))

p2 + geom_label(aes(label = color))

The ggreplel gives an easy way to annotate the labels when they are densely packed.

library(ggrepel)
p2 + geom_point() + geom_text_repel(aes(label=color), size = 3)

It doesn’t work that well though if you have too many points clustered together, then the lines pointing to the points will extend too far way, to make room for all the labels.

p1 + geom_point() + geom_text_repel(aes(label=color), size = 3)

In these cases you should choose to label only a subset of points.

set.seed(123456)
subsetData <- subset(dsmall, sample(c(TRUE, FALSE), nrow(dsmall), replace = TRUE,
                                    prob = c(0.2, 0.8)))
p1 + geom_point() + 
  geom_text_repel(data = subsetData, aes(label=color), size = 5, col = "Blue")

The Economist Data

For practice, you will try to recreate a plot published in the Economist issue of July 20th, 2016 reflecting the relationship between well-being and financial inclusion.

Graph source: http://www.economist.com/blogs/graphicdetail/2016/07/daily-chart-13

You will generate this figure step by step through a series of included exercises using the tools we’ve just learned and will learn about.

The data for the exercises is available in the dataSets/EconomistData.csv file. Read it in with the following commands:

dat <- read.csv("./data/EconomistData.csv")
head(dat)
    Country SEDA.Current.level SEDA.Recent.progress Wealth.to.well.being.coefficient Growth.to.well.being.coefficient
1   Albania               50.0                 63.3                             1.27                             1.31
2   Algeria               40.6                 46.5                             0.87                             1.03
3    Angola               17.8                 76.2                             0.54                             1.21
4 Argentina               54.1                 49.1                             0.91                             0.89
5   Armenia               43.8                 46.0                             1.25                             1.11
6 Australia               87.9                 40.9                             1.07                             0.92
  Percent.of.15plus.with.bank.account                  EPI_regions                        Region
1                            37.98635    Central and Eastern Europ                        Europe
2                            50.47579 Middle East and North Africa    Middle East & North Africa
3                            29.31812           Sub-Saharan Africa            Sub-Saharan Africa
4                            50.19730    Latin America and Caribbe Latin America & the Caribbean
5                            17.66907 Middle East and North Africa    Middle East & North Africa
6                            98.85957    East Asia and the Pacific                       Oceania

The original sources for this data are:

The countries assignment to regions is based on the EPI_regions column in the countryExData data.frame from rworldmap package. The Region variable was matched with the categories in the Economist plot.

Exercise I

For the EconomistData.csv do the following:

  1. Create a scatter plot with percent of people over the age of 15 with a bank account on the x axis and the SEDA score on the y axis.
  2. Color the points in the previous plot blue.
  3. Color the points in the previous plot according to the Region.
  4. Create boxplots of SEDA scores by Region.
  5. Overlay points on top of the box plots
# (...?)

Statistical Transformations

Now, we will go back to the diamonds data set and return to the Economist plot later.

So far we have only dealt with the (x,y) type of plots (scatter plots or line plots) where each of the point has its corresponding (x,y) coordinate.

Sometimes, however, we are more interested in plots that require some statistical transformations. The transformations might map a raw datapoint or a group of datapoints to other values. Examples of plots involving statistical transformations:

These types of plots require some statistical transformations. For example:

Boxplots and jittered points

ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter()

j1 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/5))
j2 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/50))
j3 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/200))
grid.arrange(j1, j2, j3, ncol = 3)

Here we used grid.arrange() for the package gridExtra to display multiple plots in the same line.

Sometimes less is more…

ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_boxplot()

Histogram and density plots

Below we plot the distribution of the weights (carat) of the diamonds.

h <- ggplot(data = diamonds, aes(x = carat)) + geom_histogram()
d <- ggplot(data = diamonds, aes(x = carat)) + geom_density()
grid.arrange(h, d, ncol = 2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • For the density plot, the adjust argument controls the degree of smoothness (high values of adjust produce smoother plots).
  • For the histogram, the binwidth or bins argument controls the amount of smoothing by setting the bin size or the number of bins. (Break points can also be specified explicitly, using the breaks argument.)
p <- ggplot(data = diamonds, aes(x = carat)) + xlim(0, 3)
h1 <- p + geom_histogram(binwidth = 1) 
h2 <- p + geom_histogram(binwidth = 0.1) 
h3 <- p + geom_histogram(binwidth = 0.01) 
grid.arrange(h1, h2, h3, ncol = 3)
Warning: Removed 32 rows containing non-finite values (stat_bin).
Warning: Removed 32 rows containing non-finite values (stat_bin).
Warning: Removed 32 rows containing non-finite values (stat_bin).

d1 <- p + geom_density(adjust = 5) 
d2 <- p + geom_density(adjust = 1) 
d3 <- p + geom_density(adjust = 1/5) 
grid.arrange(d1, d2, d3, ncol = 3)
Warning: Removed 32 rows containing non-finite values (stat_density).
Warning: Removed 32 rows containing non-finite values (stat_density).
Warning: Removed 32 rows containing non-finite values (stat_density).

The histograms can be broken down into groups. Here we show grouping by diamonds cut.

h <- p + geom_histogram(aes(fill = cut), position = "dodge", bins = 10)
d <- p + geom_density(aes(color = cut))
grid.arrange(h, d, ncol = 2)
Warning: Removed 32 rows containing non-finite values (stat_bin).
Warning: Removed 4 rows containing missing values (geom_bar).
Warning: Removed 32 rows containing non-finite values (stat_density).

Instead of the marginal distribution, we can plot the components stacked on top of each other to see the contribution from each of group.

h <- p + geom_histogram(aes(fill = cut), position = "stack")
d <- p + geom_density(aes(fill = cut), position = "stack")
grid.arrange(h, d, ncol = 2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 32 rows containing non-finite values (stat_bin).
Warning: Removed 32 rows containing non-finite values (stat_density).

Bar charts

The discrete analogue of histogram is the bar chart, geom = "bar". Instead of partitioning the values into bins like histograms, the bar geom counts the number of instances of each discrete class. The counts are then plotted as columns for each distinct class.

If you’d like to tabulate class members in some other way (rather than count), e.g. by summing up a continuous variable, you can use the weight aesthetic.

b1 <- ggplot(diamonds, aes(x = color)) + geom_bar()
b2 <- ggplot(diamonds, aes(x = color)) + geom_bar(aes(weight = carat)) + ylab("carat")
grid.arrange(b1, b2, ncol = 2)

The left plot shows counts and the right plot is the count weighted by weight = carat to show the total weight of diamonds of each color.

Thus, you don’t need to tabulate your values beforehand, as with barchart in base R. However, if you did already summarize your data, you can still use geom_bar but using another statistical transformation stat = "identity rather than the default stat = "count".

diamonds.mean <- aggregate(diamonds["carat"], diamonds["color"], FUN=mean)
rbind(head(diamonds.mean), tail(diamonds.mean))
   color     carat
1      D 0.6577948
2      E 0.6578667
3      F 0.7365385
4      G 0.7711902
5      H 0.9117991
6      I 1.0269273
21     E 0.6578667
31     F 0.7365385
41     G 0.7711902
51     H 0.9117991
61     I 1.0269273
7      J 1.1621368

The default option will generate an error:

ggplot(diamonds.mean, aes(x=color, y=carat)) + 
  geom_bar()
Error: stat_count() must not be used with a y aesthetic.

Thus, you need to use the following:

ggplot(diamonds.mean, aes(x=color, y=carat)) + 
  geom_bar(stat="identity")

diamonds.sum <- aggregate(diamonds["carat"], diamonds["color"], FUN=sum)
ggplot(diamonds.sum, aes(x=color, y=carat)) +  geom_bar(stat="identity")

Note that this is the same plot as the one generated with the weight aesthetic, which is exactly what we should expect.

Prediction lines

We can include a regression line to plot by simply adding the line with the fitted y-values from a prediction model:

dsmall$pred.price <- predict(lm(price ~ carat, data = dsmall))
p1 <- ggplot(dsmall, aes(x = carat, y = price))
p1 + geom_point(aes(color = color)) + geom_line(aes(y = pred.price))

Smoothers

If you have a scatterplot with many data points, it can be hard to see exactly what trend is shown by the data. In this case you may want to add a smoothed line to the plot. The smooth geom includes a line and a ribbon.

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point() + geom_smooth()

For our small subset of the diamonds data set we have:

p1 + geom_point() + geom_smooth()

Changing the span argument, we can obtain more or less wiggly curve (smaller span results in more wiggliness).

grid.arrange(p1 + geom_point() + geom_smooth(span = 0.2),
             p1 + geom_point() + geom_smooth(span = 0.7), ncol = 2)

  • The default method used in geom_smooth with small number of observations (n < 1000) is method = "loess" which uses a smooth local regression. More details about the algorithm used can be found in ?loess.
  • Loess does not work well for large datasets (it’s \(O(n^2)\) in memory), and so an alternative smoothing algorithm is used when n is greater than 1,000.

Exercise II

  1. Re-create a scatter plot with percent of people aged 15+ with a bank account on the x axis and SEDA current level score on the y axis (as you did in the previous exercise).
  2. Overlay a smoothing line on top of the scatter plot using the lm method. Hint: see ?stat_smooth.
  3. Overlay a smoothing line on top of the scatter plot using the default method.
  4. Overlay a smoothing line on top of the scatter plot using the default loess method, but make it less smooth. Hint: see ?loess.
# (...?)

Scales

Aesthetic Mapping Variable Scaling

Aesthetic mapping (i.e., with aes()) is responsible for assigning an aesthetic to a variable. It doesn’t however specify how mapping should be done.

For example, aes(shape = x) or aes(color = z) do not specify what shapes or what colors should be used. To choose colors/shapes/sizes etc. you need to modify the corresponding scale.

In ggplot2 scales include:

  • position
  • color and fill
  • size
  • shape
  • line type

Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. Try typing scale_<tab> to see a list of scale modification functions.

Common Scale Arguments:

  • name: the first argument gives the axis or legend title
  • limits: the minimum and maximum of the scale
  • breaks: the points along the scale where labels should appear
  • labels: the labels that appear at each break

Scale: axes

Square root transformation of the y-axis:

p1 <- ggplot(dsmall, aes(x = carat, y = price)) 
p1 + geom_point() + scale_y_sqrt()

Log bas 10 transformation of the y-axis:

p1 + geom_point() + scale_y_log10()

Log base 10 transformation of x and y axes:

p1 + geom_point() + scale_y_log10() + scale_x_log10()

Note that the above produces the same points as:

ggplot(dsmall, aes(x = log(carat), y = log(price))) + geom_point()

but with different values on the axes.

Scale: shapes

p1 + geom_point(aes(shape = cut), size = 3)  

p1 + geom_point(aes(shape = cut), size = 3) + 
  scale_shape_manual(values = c(1:5))

Scale: colors

To choose specific colors for discrete variables we can use scale_color_manual

p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_manual(values = c("red", "orange", "yellow", "green", "blue"))

For continuous variables you can also use scale_color_gradient, and specify the ends of the spectrum:

p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red")

scale_color_brewer is a very useful function that can be used to set colors for discrete variables. It gives you a choice between many predefined and pretty color palettes.

p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_brewer(palette = "Set2")

Unfortunately scale_color_brewer doesn’t work for continuous variables:

p1 + geom_point(aes(shape = price), size = 3) + 
  scale_color_brewer(palette = "Spectral")
Error: A continuous variable can not be mapped to shape

Thankfully, we can get around this issue using the RColorBrewer package and using scale_color_gradientn:

library(RColorBrewer)
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradientn(colours = brewer.pal(name = "Spectral", n = 10))

… and if you are a true indie person, you can is even check out the color schemes based on Wes Anderson movies:

#install.packages("wesanderson")
library(wesanderson)
names(wes_palettes)
 [1] "GrandBudapest"  "Moonrise1"      "Royal1"         "Moonrise2"      "Cavalcanti"     "Royal2"         "GrandBudapest2"
 [8] "Moonrise3"      "Chevalier"      "Zissou"         "FantasticFox"   "Darjeeling"     "Rushmore"       "BottleRocket"  
[15] "Darjeeling2"   

For discrete:

p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_manual(values = wes_palette("Darjeeling", n = 5))

For continuous:

p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradientn(colours = wes_palette("Darjeeling", 100, type = "continuous"))

You can also scale the values of the variable corresponding to color.

p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red", trans = "log10")

and add your own breaks

p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red", trans = "log10",
                       breaks = c(1000, 2000, 5000, 10000),
                       labels = c("  1000", "  2000", "  5000", "10000")) 
  

Exercise III

  1. For the scatter plot of % of ppl aged 15+ with bank account vs SEDA score colored by region, generated in Exercise I.3 modify the color scale to use specific values of your choosing. Hint: see ?scale_color_manual.
# (...?)

Faceting

p <-  ggplot(diamonds, aes(x = carat))
p + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram() + facet_wrap(~ color)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram() + facet_grid(cut ~ color)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(p <- ggplot(data = diamonds[sample(nrow(diamonds), 1000), ], 
             aes(x = carat, y = price)) +
  geom_point(aes(text = paste("Clarity:", clarity)), size = 1) +
  geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut))

Excercise IV

  1. Facet by region (~ Region) the the Economist plot from Exercise III.
# (... ?)

Finish the Economist plot.

To complete the graph we need to:

Change order of the Regions

dat$Region <- as.character(dat$Region)
dat$Region <- factor(dat$Region, 
                     levels = c("Europe", "Asia", "Oceania", 
                                "North America", 
                                "Latin America & the Caribbean", 
                                "Middle East & North Africa",
                                "Sub-Saharan Africa"),
                     labels = c("Europe", "Asia", "Oceania", 
                                "North America", 
                                "Latin America & \n the Caribbean", 
                                "Middle East & \n North Africa",
                                "Sub-Saharan \n Africa"))
ggplot(dat, aes(Percent.of.15plus.with.bank.account, SEDA.Current.level)) + 
  geom_point(aes(color = Region))

Add the linear trend

# (...?)

Change the axes ratio.

Hint ?coord_fixed

# (...?)

Change the color scheme

Use the following colors c("#28AADC","#F2583F", "#76C0C1","#24576D", "#248E84","#DCC3AA", "#96503F")

# (...?)

Add a title and format the axes

Check ?scale_x_continuous and ?scale_y_continuous documentation. To add a title use ggtitle().

# (...?)

Change the background and theme

You can check out the ggthemes package which implement the themes that make your plots look like they came from:

  • Base graphics
  • Tableau
  • Excel
  • Stata
  • the Economist
  • Wall Street Journal
  • Edward Tufte
  • Nate Silver’s Fivethirtyeight
  • etc.

use the one than mimics the Economist.

library(ggthemes)
# (...?)

Format the legend

Using theme() and the arguments like: legend.position, legend.direction, legend.text, plot.margin

# (...?)

Add point labels

Add labels to the following subset of the countries:

pointsToLabel <- c("Yemen", "Iraq", "Egypt", "Jordan", "Chad", "Congo", 
                   "Angola", "Albania", "Zimbabwe", "Uganda", "Nigeria",
                   "Uruguay", "Kazakhstan", "India", "Turkey", "South Africa",
                   "Kenya", "Russia", "Brazil", "Chile", "Saudi Arabia", 
                   "Poland", "China", "Serbia", "United States", "United Kingdom")

Use geom_text_repel()

# (...?)

Add notes to the bottom and save the plot

Use grid.text()

# (...?)

plotly for interactive plotting

You can also easily generate an interactive plot by calling ggplotly() function from plotly package.

#install.packages("plotly")
library(plotly)
#ggplotly(pEconomist)

Similar to the original:

What we have learned so far:

Additional resources:


  1. http://ggplot2.org/

  2. https://www.bcgperspectives.com/content/articles/growth-globalization-private-sector-opportunity-improve-well-being-2016-economic-development-assessment/

---
title: "Generating graphics with R"
output: html_notebook
---

# Objectives

In this section we will learn how to generate graphics in R.

The material presented is here largely based on the material presented in:

* [workshop notes](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html) by Data Science Services at Harvard IQSS 
* and [Chapter 2](http://ggplot2.org/book/qplot.pdf) from Hadley Wickham's book on ggplot2.

# Install and load packages

```{r check-install-load-packages, warning=FALSE, message=FALSE}
# Clear the workspace
rm(list = ls())

# List of package needed for this workshop
reqpkg <- c("ggplot2", "ggrepel", "ggthemes", "grid", "gridExtra", 
            "RColorBrewer")

# Check if the packages are installed:
inpkg = installed.packages()[, "Package"] #installed packages
neededpkg = reqpkg[!reqpkg %in% inpkg]
if(length(neededpkg) > 0){
  stop(paste("\n Need to install the following package:", neededpkg))
}

# Load all required packages and show version
for(i in reqpkg){
	print(paste(i, "version:", packageVersion(i)))
	library(i, quietly=TRUE, verbose=FALSE, warn.conflicts=FALSE, character.only=TRUE)
}
```

If you haven't done that already, download the workshop materials:
```{r eval = FALSE}
setwd("/path/to/dir/")
download.file("/url/to/workshop/materials", "R_workshop.zip")
unzip("R_workshop.zip")
```


# Introduction

Instructions:

* Ask *any* question *any* time if *any*thing is unclear!
* Collaboration is encouraged. Ask a friend! They might dealt with the same
issue just a minute ago.
* Each of you should have 2 post-its. Use them to attach to you laptop.
    + <span style="color:red">red</span> -- if you can't figure out a task
    + <span style="color:green">green</span> -- everything is good


This course assumes that you:

* have some experience with R already (e.g. the previous 2 workshop sections)
* would like to learn how to create cool visualization R without 
going much into detail of what computations are done on the side.


# What is `ggplot2`?

> `ggplot2` is a plotting system for R, based on the grammar of graphics. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. ^[http://ggplot2.org/]

Advantages of `ggplot2`:

* plots are defined at a high level of abstraction,
* plots are broken down into modules/layers
* a great flexibility when *customizing* your plot
* good documentation 
* a large user base (easy access to help)

Weaknesses of `ggplot2` (what the package should not be used for):

* 3D graphics: see [`rgl`](https://cran.r-project.org/web/packages/rgl/vignettes/rgl.html) 
package instead or [`ggplot2` + `plotly`](http://blog.revolutionanalytics.com/2014/11/3-d-plots-with-plotly.html)
* graph/network plots with nodes and edges: see [`igraph`](http://igraph.org/r/) package instead
* interactive graphics: see [`ggvis`](http://ggvis.rstudio.com/ggvis-basics.html), 
[`plotly`](https://github.com/ropensci/plotly) packages instead


# What is the grammar of graphics?

It is a concept coined by Leland Wilkinson in *2005*.

**The basic idea:** a plot is defined by independent building blocks, which
combined create just about any kind of visualization you want. 

The building blocks of a graph include:

* data
* aesthetic mapping
* geometric objects
* statistical transformations
* scales
* coordinate system
* positioning adjustments
* facteing

# The structure of ggplot object

The `ggplot()` function is used to initialize the basic graph structure.
It cannot produce the plot we want by itself. Instead, we need to add extra 
building blocks it. The structure of a ggplot looks like this:

```{r, eval = FALSE}
ggplot(data = <default data set>, 
       aes(x = <default x axis variable>,
           y = <default y axis variable>,
           ... <other default aesthetic mappings>),
       ... <other plot defaults>) +
  
  geom_<geom type>(aes(size = <size variable for this geom>, 
                       ... <other aesthetic mappings>),
                   data = <data for this point geom>,
                   stat = <statistic string or function>,
                   position = <position string or function>,
                   color = <"fixed color specification">,
                  ... <other arguments, possibly passed to the _stat_ function) +

  scale_<aesthetic>_<type>(name = <"scale label">,
                           breaks = <where to put tick marks>,
                           labels = <labels for tick marks>,
                           ... <other options for the scale>) +

  theme(plot.background = element_rect(fill = "gray"),
        ... <other theme elements>)
```

This chunk of code might seem confusing, but by the end of this workshop
you should be able to understand each of the components.

**The basic idea** is that you *specify different parts* of the plot, 
and *add them together* using the + operator.


# `ggplot2` vs base graphics:

`ggplot2` compared to base graphics is:

* more verbose for simple / *out of the box* graphics
* less verbose for complex / custom graphics
* uses a different system for adding plot elements
(`+` adding operator instead of calling new functions like 
points(), lines() etc.)

## Example 1: `History of unemployemnt`

```{r, fig.width=10, fig.height=6}
data("economics")
str(economics)
```

```{r}
head(economics)
```

Base graphics with a simple `plot()` function:

```{r}
plot(unemploy/pop ~ date, data = economics,  type = "l")
```

A similar plot using `ggplot2`

```{r}
ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line()
```

Note that, `ggplot()` by itself does not plot the data. 
```{r}
ggplot(data = economics, aes(x = date, y = unemploy/pop))
```
You need to add the *lines* object.

```{r}
ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line()
```

... and possibly change the background color from 
<span style="color:gray">default gray</span> to 
<span style="color: white; background:gray">customized white</span>. 

```{r}
ggplot(data = economics, aes(x = date, y = unemploy/pop)) + geom_line() +
  theme_bw()
```

## What if we want to compare the trend from year 2009 to 2014?

Add two variables one for year and one for day of the year: 
```{r}
economics$dayOftheYear <- format(economics$date, format="%m-%d")
economics$dayOftheYear <- as.Date(economics$dayOftheYear, format="%m-%d")
economics$year <- format(economics$date, format="%Y")
head(economics)
```

Using base graphics:

```{r}
plot(unemploy/pop ~ dayOftheYear, data = subset(economics, year == 2009), 
     ylim = c(0.025, 0.05), type = "l")
lines(unemploy/pop ~ dayOftheYear, col = "red", data = subset(economics, year == 2014))
legend("topleft",
       c("2008", "2014"), title="Year",
       col=c("black", "red"),
       pch=c(1, 1))
```

Using `ggplot2`:

```{r}
ggplot(data = subset(economics, year %in% c(2014, 2009)), 
       aes(x = dayOftheYear, y = unemploy/pop)) + 
  geom_line(aes(color = year)) 
```

Note that there is no need to specify the legend! It is produced automatically
in `ggplot2`.

It is easy to even plot all the years together:

```{r fig.height=5}
ggplot(data = economics, aes(x = dayOftheYear, y = unemploy/pop)) + 
  geom_line(aes(color = year))
```

## Example 2: diamonds dataset

We will now load a `diamonds` data set that is included in with the `ggplot2` 
package. 

The data set contains the prices and other attributes of almost 54,000 
diamonds. You can call `?diamonds` to learn more about the available attributes.

```{r}
data("diamonds")
str(diamonds)
```

```{r}
head(diamonds)
```

It is easy to plot the distribution of the diamonds prices with base graphics

```{r}
hist(diamonds$price)
```

as well as with `ggplot2`

```{r}
ggplot(diamonds, aes(x = price)) + geom_histogram()
```

Now we subset the data to show the relationship between the diamonds
weights (carat = 200 mg) and their prices ($): 

```{r}
set.seed(12345) # Make the sample reproducible
dsmall <- diamonds[sample(nrow(diamonds), 200), ]
```

Base graphics:

```{r}
colorMap <- data.frame(color = rainbow(length(unique(dsmall$color))))
rownames(colorMap) <- unique(dsmall$color)

plot(price ~ carat, data = dsmall, col = colorMap[dsmall$color, "color"])
legend(x = 'bottomright', 
       legend = rownames(colorMap),
       col = colorMap$color, pch = par("pch"), bty = 'n', xjust = 1)
```

`ggplot2`:

```{r}
ggplot(data = dsmall, aes(x = carat, y = price, color = color)) + geom_point()
```


# Geometric objects and Aesthetics

## Geometic Objects (`geom`):

Geometric objects are the actual items we put on a plot. Examples include:

* points (geom_point, for scatter plots, dot plots, etc)
* lines (geom_line, for time series, trend lines, etc)
* boxplot (geom_boxplot, for, well, boxplots!)

A plot must have at least one geom; there is no upper limit. 
You can add a geom to a plot using the `+` operator.

You can get a list of available geometric objects using the code below:

```{r}
help.search("geom_", package = "ggplot2")
```

or simply type `geom_<tab>` in any good R IDE (such as Rstudio or ESS) to see
a list of functions starting with `geom_`.


## Aesthetic Mapping

> In ggplot an *aesthetic mapping*, defined with aes(), describes how variables 
are mapped to visual properties or aesthetics.

Examples of aesthetics are: 

* position (i.e., on the x and y axes)
* color ("outside" color)
* fill ("inside" color)
* shape (of points)
* linetype
* size

Each type of geom accepts only a subset of all aesthetics. Refer to
the geom help pages to see what mappings each geom accepts. 

## Scatter plots (`geom_points`)


```{r}
p1 <- ggplot(dsmall, aes(x = carat, y = price))
p1 + geom_point()
```

```{r}
p1 + geom_point(aes(color = color))
```

```{r}
p1 + geom_point(aes(shape = cut))
```

```{r}
p1 + geom_point(aes(shape = cut, color = color))
```

## Aesthetic Mapping vs Assignment

Note that variables are mapped to aesthetics with the `aes()` function, 
while fixed aesthetics are set outside the `aes()` call. 

This sometimes leads to confusion, as in this example:

```{r}
ggplot(data = dsmall, aes(x = carat, y = price)) + 
  geom_point(aes(size = 2), # this is conceptually wrong since 2 is not a variable
             color = "darkgreen") # this is ok since you might want all points to be green.
```

```{r}
ggplot(data = dsmall, aes(x = carat, y = price)) + 
  geom_point(aes(fill = cut), size = 2, color = "black", shape = 25)
```


## Available shape configurations

```{r}
## A look at all 25 symbols
df2 <- data.frame(x = 1:5 , y = 1:25, z = 1:25)
s <- ggplot(df2, aes(x = x, y = y))
s + geom_point(aes(shape = z), size = 3, colour = "blue") +
  scale_shape_identity()
```


```{r}
## While all symbols have a foreground colour, symbols 19-25 also take a
## background colour (fill)
s + geom_point(aes(shape = z), size = 3, colour = "darkgreen", fill = "orange") +
  scale_shape_identity()
```


## Data transformations

We can plot the transformed data by calling a function on the variable.
For example here we show the log transformed data.

```{r}
ggplot(dsmall, aes(x = log(carat), y = log(price))) + geom_point()
```

## Text labels

Use even a smaller subset to avoid cluttering:
```{r}
set.seed(12345) # Make the sample reproducible
dsmall2 <- diamonds[sample(nrow(diamonds), 100), ]
```

```{r}
p2 <- ggplot(dsmall2, aes(x = log(carat), y = log(price)))
p2 + geom_text(aes(label = color))
```

```{r}
p2 + geom_label(aes(label = color))
```

The `ggreplel` gives an easy way to annotate the labels when they are densely
packed. 

```{r}
library(ggrepel)
p2 + geom_point() + geom_text_repel(aes(label=color), size = 3)
```

It doesn't work that well though if you have too many points clustered
together, then the lines pointing to the points will extend too far way,
to make room for all the labels. 


```{r}
p1 + geom_point() + geom_text_repel(aes(label=color), size = 3)
```

In these cases you should choose to label
only a subset of points.

```{r}
set.seed(123456)
subsetData <- subset(dsmall, sample(c(TRUE, FALSE), nrow(dsmall), replace = TRUE,
                                    prob = c(0.2, 0.8)))
p1 + geom_point() + 
  geom_text_repel(data = subsetData, aes(label=color), size = 5, col = "Blue")
```

# The Economist Data

For practice, you will try to recreate
a plot published in the Economist issue of July 20th, 2016 reflecting
the relationship between well-being and financial inclusion.

![](./figures/economist.png)

Graph source: [http://www.economist.com/blogs/graphicdetail/2016/07/daily-chart-13](http://www.economist.com/blogs/graphicdetail/2016/07/daily-chart-13)

You will generate this figure step by step through a series of included 
exercises using the tools we've just learned and will learn about. 

The data for the exercises is available in the `dataSets/EconomistData.csv` file. 
Read it in with the following commands:

```{r}
dat <- read.csv("./data/EconomistData.csv")
head(dat)
```


The original sources for this data are:

* the Boston Consulting Group’s [report on countries' well-being](https://www.bcgperspectives.com/Images/BCG-The-Private-Sector-Opportunity-to-Improve-Well-Being-Jul-2016.pdf),
which includes Sustainable Economic Development Assessment (SEDA) scores, 
*powerful diagnostics designed to provide government leaders with a perspective on how effectively their countries convert wealth, as measured by income levels, into well-being*^[https://www.bcgperspectives.com/content/articles/growth-globalization-private-sector-opportunity-improve-well-being-2016-economic-development-assessment/] 
* the World Bank [Global Findex database](http://datatopics.worldbank.org/financialinclusion/),
which records the indices of financial inclusion, including the percent
of people aged 15 or more with a bank account.


The countries assignment to regions is based on the EPI_regions column in 
the `countryExData` data.frame from `rworldmap` package. The `Region` variable 
was matched with the categories in the Economist plot. 


# Exercise I

For the `EconomistData.csv` do the following:

1. Create a scatter plot with percent of people over the age of 15 with a bank 
account on the x axis and the SEDA score on the y axis.
2. Color the points in the previous plot blue.
3. Color the points in the previous plot according to the `Region`.
4. Create boxplots of SEDA scores by `Region`.
5. Overlay points on top of the box plots


```{r}
# (...?)
```


# Statistical Transformations

Now, we will go back to the `diamonds` data set and return to 
the Economist plot later.

So far we have only dealt with the (x,y) type of plots (scatter plots or 
line plots) where each of the point has its corresponding (x,y) coordinate.

Sometimes, however, we are more interested in plots that require some 
statistical transformations. The transformations might map a raw datapoint or 
a group of datapoints to other values. Examples of plots involving statistical 
transformations:

* boxplots we just generated for the Economist data,
* histograms
* prediction lines etc.
* bar charts

These types of plots require some statistical transformations. For example:

* boxplots require computations of the the median, lower and upper quartile, 
and 1.5 * IQR of the y-values,
* smoothers compute the predicted values for y-values,
* histograms group the values into bins, 
* bar charts counts the classes occurrences. 

## Boxplots and jittered points

```{r}
ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter()
```


```{r fig.height=3, fig.width=7}
j1 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/5))

j2 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/50))

j3 <- ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_jitter(alpha = I(1/200))

grid.arrange(j1, j2, j3, ncol = 3)
```

Here we used `grid.arrange()` for the package `gridExtra` to display 
multiple plots in the same line.

Sometimes less is more...

```{r}
ggplot(data = diamonds, aes(x = color, y =price/carat)) +
  geom_boxplot()
```

## Histogram and density plots

Below we plot the distribution of the weights (carat) of the diamonds.


```{r fig.height=3, fig.width=7}
h <- ggplot(data = diamonds, aes(x = carat)) + geom_histogram()
d <- ggplot(data = diamonds, aes(x = carat)) + geom_density()

grid.arrange(h, d, ncol = 2)
```

* For the density plot, the `adjust` argument controls the degree of smoothness
(high values of adjust produce smoother plots). 
* For the histogram, the `binwidth` or `bins` argument controls the amount 
of smoothing by setting the bin size or the number of bins. 
(Break points can also be specified explicitly, using the breaks argument.) 

<!-- It is very important to experiment with the level of smoothing. With a histogram -->
<!-- you should try many bin widths: You may find that gross features of the data -->
<!-- show up well at a large bin width, while finer features require a very narrow -->
<!-- width. -->

```{r fig.height=3, fig.width=7}
p <- ggplot(data = diamonds, aes(x = carat)) + xlim(0, 3)

h1 <- p + geom_histogram(binwidth = 1) 
h2 <- p + geom_histogram(binwidth = 0.1) 
h3 <- p + geom_histogram(binwidth = 0.01) 

grid.arrange(h1, h2, h3, ncol = 3)
```

```{r fig.height=3, fig.width=7}
d1 <- p + geom_density(adjust = 5) 
d2 <- p + geom_density(adjust = 1) 
d3 <- p + geom_density(adjust = 1/5) 

grid.arrange(d1, d2, d3, ncol = 3)
```

The histograms can be broken down into groups. Here we show grouping by diamonds
cut.

```{r fig.height=3, fig.width=7}
h <- p + geom_histogram(aes(fill = cut), position = "dodge", bins = 10)
d <- p + geom_density(aes(color = cut))

grid.arrange(h, d, ncol = 2)
```

Instead of the marginal distribution, we can plot the components **stacked**
on top of each other to see the contribution from each of group.

```{r fig.height=3, fig.width=7}
h <- p + geom_histogram(aes(fill = cut), position = "stack")
d <- p + geom_density(aes(fill = cut), position = "stack")

grid.arrange(h, d, ncol = 2)
```

## Bar charts

The discrete analogue of histogram is the bar chart, `geom = "bar"`. 
Instead of partitioning the values into bins like histograms, the bar
geom counts the number of instances of each discrete class. The counts
are then plotted as columns for each distinct class.

If you’d like to tabulate class members in some other way (rather than count), 
e.g. by summing up a continuous variable, you can use the `weight` aesthetic. 

```{r fig.height=3, fig.width=6.8}
b1 <- ggplot(diamonds, aes(x = color)) + geom_bar()
b2 <- ggplot(diamonds, aes(x = color)) + geom_bar(aes(weight = carat)) + ylab("carat")
grid.arrange(b1, b2, ncol = 2)
```

The left plot shows counts and the right plot is the count weighted by 
`weight = carat` to show the total weight of diamonds of each color.


Thus, you don’t need to tabulate your values beforehand, as with `barchart` 
in base R. However, if you did already summarize your data, you can still use
`geom_bar` but using another statistical transformation `stat = "identity`
rather than the default `stat = "count"`.

```{r}
diamonds.mean <- aggregate(diamonds["carat"], diamonds["color"], FUN=mean)
rbind(head(diamonds.mean), tail(diamonds.mean))
```

The default option will generate an error:

```{r}
ggplot(diamonds.mean, aes(x=color, y=carat)) + 
  geom_bar()
```

Thus, you need to use the following:

```{r}
ggplot(diamonds.mean, aes(x=color, y=carat)) + 
  geom_bar(stat="identity")
```

```{r}
diamonds.sum <- aggregate(diamonds["carat"], diamonds["color"], FUN=sum)
ggplot(diamonds.sum, aes(x=color, y=carat)) +  geom_bar(stat="identity")
```

Note that this is the same plot as the one generated with the `weight`
aesthetic, which is exactly what we should expect.


## Prediction lines

We can include a regression line to plot by simply
adding the line with the fitted y-values from a prediction model:

```{r}
dsmall$pred.price <- predict(lm(price ~ carat, data = dsmall))
p1 <- ggplot(dsmall, aes(x = carat, y = price))
p1 + geom_point(aes(color = color)) + geom_line(aes(y = pred.price))
```


## Smoothers

If you have a scatterplot with many data points, it can be hard to see exactly
what trend is shown by the data. In this case you may want to add a smoothed
line to the plot. The smooth geom includes a line and a ribbon.


```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point() + geom_smooth()
```

For our small subset of the diamonds data set we have:

```{r}
p1 + geom_point() + geom_smooth()
```

Changing the `span` argument, we can obtain more or less wiggly curve
(smaller `span` results in more wiggliness).

```{r}
grid.arrange(p1 + geom_point() + geom_smooth(span = 0.2),
             p1 + geom_point() + geom_smooth(span = 0.7), ncol = 2)
```


* The default method used in `geom_smooth` with small number
of observations (`n < 1000`) is `method = "loess"` which uses a smooth local 
regression. More details about the algorithm used can be found in `?loess`. 
* Loess does not work well for large datasets (it’s $O(n^2)$ in memory), and so
an alternative smoothing algorithm is used when n is greater than 1,000.


# Exercise II

1. Re-create a scatter plot with percent of people aged 15+ with a bank account
on the x axis and SEDA current level score on the y axis 
(as you did in the previous exercise).
2. Overlay a smoothing line on top of the scatter plot using the lm method. 
Hint: see `?stat_smooth`.
3. Overlay a smoothing line on top of the scatter plot using the default method.
4. Overlay a smoothing line on top of the scatter plot using the default loess 
method, but make it less smooth. Hint: see `?loess`.

```{r}
# (...?)
```


# Scales


## Aesthetic Mapping Variable Scaling

Aesthetic mapping (i.e., with aes()) is responsible for assigning an aesthetic 
to a variable. It doesn't however specify how mapping should be done. 

For example, `aes(shape = x)` or `aes(color = z)` do not specify what shapes
or what colors should be used. To choose colors/shapes/sizes etc. you need
to modify the corresponding scale. 

**In ggplot2 scales include:**

* position
* color and fill
* size
* shape
* line type

Scales are modified with a series of functions using a 
`scale_<aesthetic>_<type>` naming scheme. Try typing `scale_<tab>` to see 
a list of scale modification functions.

**Common Scale Arguments:**

* **name**: the first argument gives the axis or legend title
* **limits**: the minimum and maximum of the scale
* **breaks**: the points along the scale where labels should appear
* **labels**: the labels that appear at each break

## Scale: axes

Square root transformation of the y-axis:

```{r}
p1 <- ggplot(dsmall, aes(x = carat, y = price)) 
p1 + geom_point() + scale_y_sqrt()
```

Log bas 10 transformation of the y-axis:

```{r}
p1 + geom_point() + scale_y_log10()
```

Log base 10 transformation of x and y axes:

```{r}
p1 + geom_point() + scale_y_log10() + scale_x_log10()
```

Note that the above produces the same points as:
```{r}
ggplot(dsmall, aes(x = log(carat), y = log(price))) + geom_point()
```

but with different values on the axes.

## Scale: shapes

```{r}
p1 + geom_point(aes(shape = cut), size = 3)  
```

```{r}
p1 + geom_point(aes(shape = cut), size = 3) + 
  scale_shape_manual(values = c(1:5))
```

## Scale: colors

To choose specific colors for **discrete** variables we can use `scale_color_manual`

```{r}
p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_manual(values = c("red", "orange", "yellow", "green", "blue"))
```

For **continuous** variables you can also use `scale_color_gradient`, and specify
the ends of the spectrum:

```{r}
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red")
```

`scale_color_brewer` is a very useful function that can be used to set
colors for **discrete** variables. It gives you a choice between many predefined
and pretty color palettes.

```{r}
p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_brewer(palette = "Set2")
```

Unfortunately `scale_color_brewer` doesn't work for continuous variables:

```{r}
p1 + geom_point(aes(shape = price), size = 3) + 
  scale_color_brewer(palette = "Spectral")
```

Thankfully, we can get around this issue using the `RColorBrewer` package
and using `scale_color_gradientn`:

```{r}
library(RColorBrewer)
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradientn(colours = brewer.pal(name = "Spectral", n = 10))
```


... and if you are a true indie person, you can is even check out
the color schemes based on [Wes Anderson](http://wesandersonpalettes.tumblr.com/)
movies:

```{r}
#install.packages("wesanderson")
library(wesanderson)
names(wes_palettes)
```

For discrete:

```{r}
p1 + geom_point(aes(color = cut), size = 3) + 
  scale_color_manual(values = wes_palette("Darjeeling", n = 5))
```

For continuous:

```{r}
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradientn(colours = wes_palette("Darjeeling", 100, type = "continuous"))
```



You can also scale the values of the variable corresponding to color.

```{r}
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red", trans = "log10")
```

and add your own breaks 


```{r}
p1 + geom_point(aes(color = price), size = 3) + 
  scale_color_gradient(low = "blue", high = "red", trans = "log10",
                       breaks = c(1000, 2000, 5000, 10000),
                       labels = c("  1000", "  2000", "  5000", "10000")) 
  
```


# Exercise III

1. For the scatter plot of % of ppl aged 15+ with bank account vs SEDA score
colored by region, generated in Exercise I.3 modify the color scale to 
use specific values of your choosing. Hint: see `?scale_color_manual`.

```{r}
# (...?)
```

# Faceting

```{r}
p <-  ggplot(diamonds, aes(x = carat))
p + geom_histogram()
```

```{r}
p + geom_histogram() + facet_wrap(~ color)
```

```{r fig.width=8, fig.height=6}
p + geom_histogram() + facet_grid(cut ~ color)
```

```{r}
(p <- ggplot(data = diamonds[sample(nrow(diamonds), 1000), ], 
             aes(x = carat, y = price)) +
  geom_point(aes(text = paste("Clarity:", clarity)), size = 1) +
  geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut))
```


# Excercise IV
1. Facet  by region (`~ Region`) the the Economist plot from Exercise III.

```{r}
# (... ?)
```


# Finish the Economist plot.

To complete the graph we need to:

* add a trend line
* change the axis labels
* change the order of the Region labels
* change the coloring of the points
* label selected points
* change color legend's position
* adjust the axes ratio
* fix the tick marks
* match the plot's theme with the Economist theme
* add notes

## Change order of the Regions

```{r}
dat$Region <- as.character(dat$Region)
dat$Region <- factor(dat$Region, 
                     levels = c("Europe", "Asia", "Oceania", 
                                "North America", 
                                "Latin America & the Caribbean", 
                                "Middle East & North Africa",
                                "Sub-Saharan Africa"),
                     labels = c("Europe", "Asia", "Oceania", 
                                "North America", 
                                "Latin America & \n the Caribbean", 
                                "Middle East & \n North Africa",
                                "Sub-Saharan \n Africa"))
```

```{r}
ggplot(dat, aes(Percent.of.15plus.with.bank.account, SEDA.Current.level)) + 
  geom_point(aes(color = Region))
```

# Add the linear trend

```{r}
# (...?)
```

## Change the axes ratio.

Hint `?coord_fixed`

```{r}
# (...?)
```

# Change the color scheme

Use the following colors 
`c("#28AADC","#F2583F", "#76C0C1","#24576D", "#248E84","#DCC3AA", "#96503F")`

```{r}
# (...?)
```


## Add a title and format the axes

Check `?scale_x_continuous` and `?scale_y_continuous`
documentation. To add a title use `ggtitle()`.
```{r}
# (...?)
```

## Change the background and theme

You can check out the [`ggthemes`](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html) 
package which implement the themes that make your plots look like they came from:

* Base graphics
* Tableau
* Excel
* Stata
* the Economist
* Wall Street Journal
* Edward Tufte
* Nate Silver's Fivethirtyeight
* etc.

use the one than mimics the Economist.

```{r}
library(ggthemes)
# (...?)
```

## Format the legend

Using `theme()` and the arguments like: 
`legend.position`, `legend.direction`, `legend.text`, `plot.margin`

```{r, fig.width=9, fig.height=5}
# (...?)
```


## Add point labels

Add labels to the following subset of the countries:

```{r}
pointsToLabel <- c("Yemen", "Iraq", "Egypt", "Jordan", "Chad", "Congo", 
                   "Angola", "Albania", "Zimbabwe", "Uganda", "Nigeria",
                   "Uruguay", "Kazakhstan", "India", "Turkey", "South Africa",
                   "Kenya", "Russia", "Brazil", "Chile", "Saudi Arabia", 
                   "Poland", "China", "Serbia", "United States", "United Kingdom")
```

Use `geom_text_repel()`

```{r, fig.width=9, fig.height=5}
# (...?)
```

## Add notes to the bottom and save the plot

Use `grid.text()`

```{r}
# (...?)
```



# `plotly` for interactive plotting

You can also easily generate an interactive plot by calling `ggplotly()`
function from `plotly` package.

```{r}
#install.packages("plotly")
```


```{r, fig.width=9, fig.height=5}
library(plotly)
#ggplotly(pEconomist)
```


Similar to the original:

![](./figures/economist.png)


# What we have learned so far:

* 2D plotting in R can be done with:
    + base graphics or `ggplot2`
* The building blocks of `ggplot2`
* How to generate:
    + line plots
    + scatter plots
    + histograms
    + bar plots
    + boxplots
    + prediction/trend lines or smoothers
* How to modify the aesthetics settings:
    + coloring scheme
    + shapes
* How to use themes to automatically change the style of a plot.
* Facet the plot to display the information for different subsets of data with 
different values of a specific attribute.

# Additional resources:

* Hadley Wickham's [R for Data Science](http://r4ds.had.co.nz/): Chapter 3. Data Visualization
* Hadley Wickham's [ggplot2: Elegant Graphics for Data Analysis](Use R!)(http://ggplot2.org/book/) 
* Wiki: https://github.com/hadley/ggplot2/wiki
* `plotly` [github](https://github.com/ropensci/plotly)
